Advanced Quantitative Methods

Week 3: More on categorical data

Ozan Aksoy

University of Oxford

Aims for today

  • Model fit & model comparison

  • Mediation in logit models

  • Multinomial regression

Packages & data

# Load packages
library(here)
library(tidyverse)
library(broom)
library(marginaleffects)
# khb package - see details on installing later
# lmtest package - to compare models
# modelsummary package - show models side-by-side in a table
# nnet package - for multinomial logistic regression


df <- read_csv(here("data", "bhps2001smoking.csv"))
dfclean <- drop_na(df)
dfclean$income <- dfclean$income/1000
head(df, 4)
# A tibble: 4 × 9
  smoker primsec female   age income married divsep widow nevermar
   <dbl>   <dbl>  <dbl> <dbl>  <dbl>   <dbl>  <dbl> <dbl>    <dbl>
1      0       1      1    68  8409.       0      0     1        0
2      0       0      1    59  5596.       0      1     0        0
3      0       1      1    69  9814.       0      0     0        1
4      0       0      0    40 29710.       0      0     0        1

Comparing two OLS models

ols1 <- lm(income ~ age + female, data = dfclean)
ols2 <- lm(income ~ age + female + primsec, data = dfclean)
summary(ols1)

Call:
lm(formula = income ~ age + female, data = dfclean)

Residuals:
   Min     1Q Median     3Q    Max 
-23.87  -6.78  -2.26   3.55 496.05 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 28.63013    0.65311   43.84   <2e-16 ***
age         -0.18302    0.01146  -15.97   <2e-16 ***
female      -8.47546    0.37266  -22.74   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 14.19 on 5871 degrees of freedom
Multiple R-squared:  0.1191,    Adjusted R-squared:  0.1188 
F-statistic: 396.9 on 2 and 5871 DF,  p-value: < 2.2e-16

R-square

R-square: model variance compared to total variance

var(ols1$fitted)/var(dfclean$income)
[1] 0.119115

Read it in output or get the value from the OLS object

summary(ols1)$r.squared
[1] 0.119115
summary(ols1)$adj.r.squared
[1] 0.1188149

And for the second model:

var(ols2$fitted)/var(dfclean$income)
[1] 0.1644643
summary(ols2)$adj.r.squared
[1] 0.1640373

Residual standard error

Average deviation of observed from predicted values in the original scale of y. Absolute, not relative like R-square.

# RSE by hand:
sqrt(sum(residuals(ols1)^2)/(ols1$df.residual))
[1] 14.19261
sqrt(sum(residuals(ols2)^2)/(ols2$df.residual))
[1] 13.82363
# note that (residuals(ols2)^2) = RSS

# same scale as original
# what is the sd of y in the data?
sd(dfclean$income)
[1] 15.1192

Comparing nested models

Is it worth adding education and then marital status?

ols3 <- lm(income ~ age + female + primsec + divsep + nevermar + widow, data = dfclean)
tidy(ols3)
# A tibble: 7 × 5
  term        estimate std.error statistic   p.value
  <chr>          <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)   31.4      0.746      42.1  0        
2 age           -0.164    0.0132    -12.4  4.57e- 35
3 female        -8.24     0.369     -22.3  7.26e-106
4 primsec       -6.87     0.380     -18.1  3.38e- 71
5 divsep         1.01     0.585       1.73 8.30e-  2
6 nevermar      -1.74     0.545      -3.19 1.43e-  3
7 widow          3.08     0.691       4.46 8.20e-  6
sqrt(sum(residuals(ols1)^2)/(ols1$df.residual))
[1] 14.19261
sqrt(sum(residuals(ols2)^2)/(ols2$df.residual))
[1] 13.82363
sqrt(sum(residuals(ols3)^2)/(ols3$df.residual))
[1] 13.78746

Comparing nested models - adj R2

summary(ols1)$adj.r.squared
[1] 0.1188149
summary(ols2)$adj.r.squared
[1] 0.1640373
summary(ols3)$adj.r.squared
[1] 0.1684067

F-test

We can use the F-test to compare nested models

anova(ols1, ols2, ols3)
Analysis of Variance Table

Model 1: income ~ age + female
Model 2: income ~ age + female + primsec
Model 3: income ~ age + female + primsec + divsep + nevermar + widow
  Res.Df     RSS Df Sum of Sq       F    Pr(>F)    
1   5871 1182597                                   
2   5870 1121715  1     60882 320.273 < 2.2e-16 ***
3   5867 1115282  3      6433  11.281 2.239e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ols4 <- lm(income ~ smoker, data = dfclean)
anova(ols4, ols2)  # not nested!!
Analysis of Variance Table

Model 1: income ~ smoker
Model 2: income ~ age + female + primsec
  Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
1   5872 1339197                                  
2   5870 1121715  2    217482 569.05 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

What about logistic regression models?

glm1 <- glm(smoker ~ female + age + primsec + income + divsep + widow + nevermar, 
              family=binomial(link='logit'), data = df) 
summary(glm1)

Call:
glm(formula = smoker ~ female + age + primsec + income + divsep + 
    widow + nevermar, family = binomial(link = "logit"), data = df)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)  7.927e-02  1.625e-01   0.488   0.6256    
female      -1.889e-01  6.965e-02  -2.712   0.0067 ** 
age         -2.924e-02  2.483e-03 -11.777  < 2e-16 ***
primsec      5.690e-01  7.109e-02   8.004 1.20e-15 ***
income      -1.724e-05  3.396e-06  -5.077 3.84e-07 ***
divsep       8.564e-01  9.292e-02   9.217  < 2e-16 ***
widow        2.604e-01  1.338e-01   1.946   0.0517 .  
nevermar     5.231e-01  8.935e-02   5.855 4.78e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 6425.8  on 5873  degrees of freedom
Residual deviance: 6041.8  on 5866  degrees of freedom
  (6 observations deleted due to missingness)
AIC: 6057.8

Number of Fisher Scoring iterations: 4

Model fit for logistic regression

  • Deviance = distance between actual fit and perfect fit

  • A perfect model predicts every case as 0 or 1 as observed; it has a deviance of 0

  • Deviance can range from 0 to infinity

  • What to compare our model with? Perfect isn’t realistic

Null model

Null model has an intercept only, describing the mean for all cases

null <- glm(smoker ~ 1, family=binomial(link='logit'), 
            data = dfclean)

Calculate the mean:

tidy(null) # intercept = logodds of smoking
# A tibble: 1 × 5
  term        estimate std.error statistic   p.value
  <chr>          <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)    -1.17    0.0307     -38.2 8.06e-319
(1 / (1 + exp(1)^-(null$coefficients[1]))) # logodds to prob
(Intercept) 
  0.2364658 

Null model

This checks out:

prop.table(table(df$smoker))

        0         1 
0.7630952 0.2369048 


  • Can we do better than this by adding independent variables?

  • Can we get closer to a saturated model? (ie \(n-1\) predictors)

Residuals in logit

  • Raw Residuals \(e_i= y_i - \hat{p}_i\) run from -1 to 1

  • Pearson Residual and Standardized Pearson Residuals (SPR)

    • \(r_i^{(P)} = \frac{y_i - \hat{p_i}}{ \sqrt{\hat{p_i}(1-\hat{p_i})} }\)
    • Rescaled & adjusted for leverage
    • if the model fits, SPR approximate normal distribution
  • Deviance residuals
    • Distance between predicted and saturated
    • \(d_i = sign(e_i)[−2(y_ilog\hat{p}_i + (1–y_i)log(1–\hat{p}_i))]^{1/2}\)
    • \(\sum{d_i^2}\) = residual sum of squares, or Deviance Statistic
    • \(D_m = -2 log (\frac{L_m}{L_s}) = -2(logL_m - log L_s)\)
    • minimize deviance, maximum likelihood

(excellent explanation here)

Comparing logistic models

Compare nested logistic models with a Likelihood Ratio \(\chi^2\) test (aka: LR, \(L^2\), \(G_m\))

m1 <- glm(smoker ~ primsec + age, family=binomial(link='logit'), 
            data = dfclean)
library(lmtest)
lrtest(null, m1)
Likelihood ratio test

Model 1: smoker ~ 1
Model 2: smoker ~ primsec + age
  #Df  LogLik Df  Chisq Pr(>Chisq)    
1   1 -3212.9                         
2   3 -3085.3  2 255.15  < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Non-nested models and accounting for complexity

It is easier to predict accurately with more variabes

AIC/BIC penalise adding variables \(k\) (estimated parameters to be more precise)

Akaike Information Criterion: \(-2log(L_m) + 2k\)

Bayesian IC: \(-2log(L_m) + k * log(n)\)


BIC(null, m1) # Or AIC() for AIC
     df      BIC
null  1 6434.502
m1    3 6196.712

Raftery 1995 on \(\Delta\)BIC: 0-2 weak, 2-6 positive, 6-10 strong, 10+ very strong

Comparing two nested models

Compare nested models with a Likelihood Ratio \(\chi^2\) test (aka: LR, \(L^2\), \(G_m\))

m2 <- glm(smoker ~ primsec + age + income, family=binomial(link='logit'), 
            data = dfclean)
lrtest(m1, m2)
Likelihood ratio test

Model 1: smoker ~ primsec + age
Model 2: smoker ~ primsec + age + income
  #Df  LogLik Df  Chisq Pr(>Chisq)    
1   3 -3085.3                         
2   4 -3073.6  1 23.442  1.287e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Mediation

Mediation in OLS

Add \(z\) to a model with \(x\) and compare the change in \(\beta_x\) between the two models

How much of the effect of gender on income is due to education?

library(modelsummary)
models <- list(ols1,ols2)
modelsummary(models, fmt = 2,
             statistic = NULL,
             estimate = "{estimate} ({std.error}) {p.value}",
             gof_map = c("nobs","adj.r.squared"))
(1) (2)
(Intercept) 28.63 (0.65) <0.01 29.51 (0.64) <0.01
age -0.18 (0.01) <0.01 -0.13 (0.01) <0.01
female -8.48 (0.37) <0.01 -7.88 (0.36) <0.01
primsec -6.79 (0.38) <0.01
Num.Obs. 5874 5874
R2 Adj. 0.119 0.164

Mediation in OLS

How much of the effect of gender on income is due to education?

# Effect of female without controlling = 
round(ols1$coefficients["female"],2)
female 
 -8.48 
# Effect of female after controlling =
round(ols2$coefficients["female"],2)
female 
 -7.88 
# Controlling for education closes the gap by 
round(ols2$coefficients["female"]-ols1$coefficients["female"],2)
female 
   0.6 

Or in relative terms:

# relative change, in %
(ols1$coefficients["female"]-ols2$coefficients["female"])/ols1$coefficients["female"]*100
  female 
7.081972 

Mediation in logistic regression

Let’s not do the same in logistic regression

🚨 Even if \(Z\) is unrelated to \(X\), \(\beta_x\) will change

In OLS, what happens when we add \(Z\)?

  • \(Y = \beta_0 + \beta_1 X + \varepsilon\)
  • \(Y = \beta_0 + \beta_1 X + \beta_2 Z + \varepsilon\)
  • total variance = model variance + residual variance
  • var(Y) = var(model) + var(residuals)
  • var(Y) = var(\(\beta_0 + \beta-1\)) + \(\varepsilon\)
  • var(Y) = var(\(\beta_0 + \beta_1 + \beta_2\)) + \(\varepsilon\)
  • Add \(Z\): model variance goes up, residual variance goes down, total stays the same

What happens to the variances in a logistic setting?

Mediation in logistic regression

Write logistic regression as a latent variable model for \(y*\)

\(y*\) simple means that if \(y* >= 0, y=1\) and \(y* < 0 , y = 0\)

\[y* = \beta_0 + \beta_1 X + \varepsilon\] where \(\varepsilon\) has a standard logistic distribution, with a mean of 0 and a variance of \(\pi^2/3 \approx 3.29\)

\[y* = \beta_0 + \beta_1 X + \beta_2 Z + \varepsilon\]

\(var(y*) = var(\beta_0 + \beta_1) + \pi^2/3\)
\(var(y*) = var(\beta_0 + \beta_1 + \beta_2) + \pi^2/3\)

Add \(Z\):
- \(var(y*)\) has to increase because the residual variance is fixed (\(\pi^2/3\)), hence model variance goes up
- y is scaled differently in nested models

Question: what happens when one assumes \(\varepsilon\) has a standard normal distribution? What variance would it then have?

Mediation in logistic regression

How serious or big the problem?

  • Huge according to some, for instance Mood (2009)

  • It depends or not probleamtic according to others, for instance Kuha & Mills (2018)

  • Is your dependent variable binary or are you interested in a latent variable?

  • What is your estimand?

Approaches to address the scaling issue:

  • Marginal effects
  • Linear Probability Model
  • Karlson-Holm-Breen (KHB) method

KHB method

Compare coefficients between two models

  1. Full model: X and the mediator Z
  2. Adjusted model: X and residualized Z
    where Z is residualized with respect to X

Both models have the same fit and same residual variance

Compare X between the two models for the true impact of mediation X by Z

Also called rescaling Packages for Stata and R khb, but easy to do by hand

We want but can’t compare

nlm_reduced <- glm(smoker ~ primsec, 
              family=binomial(link='logit'), data = dfclean) 
nlm_full <- glm(smoker ~ primsec + age, 
              family=binomial(link='logit'), data = dfclean) 
tidy(nlm_reduced)
# A tibble: 2 × 5
  term        estimate std.error statistic   p.value
  <chr>          <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)   -1.42     0.0512    -27.8  4.27e-170
2 primsec        0.410    0.0641      6.39 1.61e- 10
tidy(nlm_full)
# A tibble: 3 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)  -0.0594   0.107      -0.556 5.78e- 1
2 primsec       0.659    0.0674      9.78  1.40e-22
3 age          -0.0300   0.00213   -14.1   6.08e-45

We want but can’t compare

models <- list(nlm_reduced,nlm_full)
modelsummary(models, fmt = 2,
             statistic = NULL,
             estimate = "{estimate} ({std.error}) {p.value}",
             gof_map = c("nobs", "aic", "bic"))
(1) (2)
(Intercept) -1.42 (0.05) <0.01 -0.06 (0.11) 0.58
primsec 0.41 (0.06) <0.01 0.66 (0.07) <0.01
age -0.03 (0.00) <0.01
Num.Obs. 5874 5874
AIC 6388.0 6176.7
BIC 6401.4 6196.7

ORs of 1.507 and 1.932

KHB: Residualise Z

rlm <- lm(age ~ primsec, data = dfclean) # OLS
dfclean$age_hat <- predict(rlm) # predicted value
dfclean$r_age <- dfclean$age - dfclean$age_hat # residual
cor(dfclean$r_age, dfclean$primsec)
[1] -8.755153e-15
cor(dfclean$age, dfclean$primsec)
[1] 0.2577669
nlm_adj <- glm(smoker ~ primsec + r_age, 
              family=binomial(link='logit'), data = dfclean) 
tidy(nlm_adj)
# A tibble: 3 × 5
  term        estimate std.error statistic   p.value
  <chr>          <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)  -1.48     0.0523     -28.2  3.52e-175
2 primsec       0.405    0.0651       6.22 5.03e- 10
3 r_age        -0.0300   0.00213    -14.1  6.08e- 45

Compare

tidy(nlm_adj)
# A tibble: 3 × 5
  term        estimate std.error statistic   p.value
  <chr>          <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)  -1.48     0.0523     -28.2  3.52e-175
2 primsec       0.405    0.0651       6.22 5.03e- 10
3 r_age        -0.0300   0.00213    -14.1  6.08e- 45
tidy(nlm_full)
# A tibble: 3 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)  -0.0594   0.107      -0.556 5.78e- 1
2 primsec       0.659    0.0674      9.78  1.40e-22
3 age          -0.0300   0.00213   -14.1   6.08e-45

Let’s use modelsummary to make it a bit cleaner

models <- list(nlm_reduced, nlm_full, nlm_adj)
modelsummary(models, fmt = 2,
             statistic = NULL,
             estimate = "{estimate} ({std.error}) {p.value}",
             gof_map = c("nobs", "aic", "bic"))

Compare

(1) (2) (3)
(Intercept) -1.42 (0.05) <0.01 -0.06 (0.11) 0.58 -1.48 (0.05) <0.01
primsec 0.41 (0.06) <0.01 0.66 (0.07) <0.01 0.40 (0.07) <0.01
age -0.03 (0.00) <0.01
r_age -0.03 (0.00) <0.01
Num.Obs. 5874 5874 5874
AIC 6388.0 6176.7 6176.7
BIC 6401.4 6196.7 6196.7

By hand or with the KHB package:

# difference by hand:
(nlm_adj$coefficients[2]-nlm_full$coefficients[2])
   primsec 
-0.2537485 
#install.packages("fmsb")
#install.packages("khb", repos="http://R-Forge.R-project.org")
library(khb)
k <- khb(nlm_reduced, nlm_full)

Models now have equal fit/residual variance

print(k, type = "models")
            Reduced      Adjusted     Full        
(Intercept) -1.42255 *** -1.47509 *** -0.05936    
            (0.051170)   (0.052275)   (0.106759)  
primsec      0.40986 ***  0.40487 ***  0.65862 ***
            (0.064095)   (0.065112)   (0.067359)  
Resid(age)               -0.03000 ***             
                         (0.002133)               
age                                   -0.03000 ***
                                      (0.002133)  
                                                  
Pseudo R2   0.010664     0.06391      0.06391     
Dev.        6384         6170.7       6170.7      
Null        6425.8       6425.8       6425.8      
Chisq       41.809       255.15       255.15      
Sig         ***          ***          ***         
Dl          1            2            2           
BIC         6392.7       6188         6188        

Test the change in \(\beta\)

print(k, disentangle = TRUE)
KHB method
Model type: glm lm (logit) 
Variables of interest: primsec
Z variables (mediators): age

Summary of confounding
            Ratio Percentage Rescaling
primsec   0.61473  -62.67385    0.9878
------------------------------------------------------------------
primsec :
         Estimate Std. Error  z value  Pr(>|z|)    
Reduced  0.404871   0.065112   6.2181 5.033e-10 ***
Full     0.658620   0.067359   9.7777 < 2.2e-16 ***
Diff    -0.253748   0.021896 -11.5890 < 2.2e-16 ***

Even easier method

## estimate full logit model and save predicted logodds as V
dfclean$v <- predict(nlm_full, type = "link")
## now fit the substantive models as OLS with V as dependent
rkhb_full <- lm(v ~ primsec + age, data = dfclean)
summary(rkhb_full)  # look at that R^2! :-)

Call:
lm(formula = v ~ primsec + age, data = dfclean)

Residuals:
       Min         1Q     Median         3Q        Max 
-8.759e-13  3.000e-17  1.700e-16  3.600e-16  8.500e-16 

Coefficients:
              Estimate Std. Error    t value Pr(>|t|)    
(Intercept) -5.936e-02  5.244e-16 -1.132e+14   <2e-16 ***
primsec      6.586e-01  3.244e-16  2.031e+15   <2e-16 ***
age         -3.000e-02  9.886e-18 -3.035e+15   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.184e-14 on 5871 degrees of freedom
Multiple R-squared:      1, Adjusted R-squared:      1 
F-statistic: 5.44e+30 on 2 and 5871 DF,  p-value: < 2.2e-16
rkhb_adjusted <- lm(v ~ primsec, data = dfclean)
rkhb_adjusted$coefficients[2]
  primsec 
0.4048714 

Multinomial logistic regression

Multinomial logistic regression

Binomial logistic regression:

\[logit(p) = log \frac {p} {1-p} = \beta_0 + \beta_1x_1 + ... + \beta_kx_k\]

\(k\) independent variables

Multinomial logistic regression:

\[logit(p) = log \frac {P(P_i = y)} {P(P_i = ref)} = \beta_0y + \beta_1y X_1 + ... + \beta_ky X_k\] a regression model with \(k\) independent variables when \(y\) has M values, at least more than two values

\[p(y_i = 1) + p(y_i = 2) + ... + p(y_i= M) = 1 \]

Multinomial logistic regression

dfclean$marital <- 1
dfclean[dfclean$divsep==1,]$marital <- 2
dfclean[dfclean$widow==1,]$marital <- 3
dfclean[dfclean$nevermar==1,]$marital <- 4
dfclean$marital <- as.factor(dfclean$marital)
levels(dfclean$marital) <- c("married","div/sep","widow", "single")
prop.table(table(dfclean$marital))

   married    div/sep      widow     single 
0.64589717 0.11201907 0.09805924 0.14402451 

Four outcomes, a system of three equations:

\[log \frac {P(marstat = divsep)} {P(marstat = married)} = \beta_{10} + \beta_{11}*age + \beta_{12}*female + \beta_{13}*primsec\]

\[log \frac {P(marstat = widow)} {P(marstat = married)} = \beta_{20} + \beta_{21}*age + \beta_{22}*female + \beta_{23}*primsec\] \[log \frac {P(marstat = single)} {P(marstat = married)} = \beta_{30} + \beta_{31}*age + \beta_{32}*female + \beta_{33}*primsec\]

Multinomial logistic regression

library(nnet)
mnom <- multinom(marital ~ age + female + primsec, data =dfclean, trace = FALSE)
summary(mnom)
Call:
multinom(formula = marital ~ age + female + primsec, data = dfclean, 
    trace = FALSE)

Coefficients:
        (Intercept)         age     female      primsec
div/sep   -1.386832 -0.01222036  0.4430925 -0.011088953
widow    -10.615869  0.11566742  1.3578659  0.575033172
single     1.596513 -0.06436850 -0.2985046 -0.003461308

Std. Errors:
        (Intercept)         age     female    primsec
div/sep   0.1659574 0.003049623 0.08790538 0.08756026
widow     0.3690083 0.004810113 0.12152906 0.13760496
single    0.1563999 0.003432239 0.07987463 0.08058662

Residual Deviance: 10201.01 
AIC: 10225.01 
exp(coef(mnom)['div/sep','female'])
[1] 1.557516
exp(coef(mnom)['div/sep','primsec'])
[1] 0.9889723
exp(coef(mnom)['widow','female'])
[1] 3.887887

Multinomial logistic regression

tidy(mnom) %>% filter(term == "female") %>% 
  mutate(OR = exp(estimate),
  lowerci = exp(estimate-1.96*std.error),
  upperci = exp(estimate+1.96*std.error)) 
# A tibble: 3 × 9
  y.level term   estimate std.error statistic  p.value    OR lowerci upperci
  <chr>   <chr>     <dbl>     <dbl>     <dbl>    <dbl> <dbl>   <dbl>   <dbl>
1 div/sep female    0.443    0.0879      5.04 4.64e- 7 1.56    1.31    1.85 
2 widow   female    1.36     0.122      11.2  5.52e-29 3.89    3.06    4.93 
3 single  female   -0.299    0.0799     -3.74 1.86e- 4 0.742   0.634   0.868

Or cleaned up:

# A tibble: 3 × 4
  y.level    OR lowerci upperci
  <chr>   <dbl>   <dbl>   <dbl>
1 div/sep 1.56    1.31    1.85 
2 widow   3.89    3.06    4.93 
3 single  0.742   0.634   0.868

modelsummary makes this much easier:

modelsummary(mnom, shape = term ~ response:model, fmt = 2,
             statistic = 'conf.int', conf_level = 0.95, exponentiate = TRUE)

Multinomial logistic regression

(1)
div/sep widow single
(Intercept) 0.25 0.00 4.94
[0.18, 0.35] [0.00, 0.00] [3.63, 6.71]
age 0.99 1.12 0.94
[0.98, 0.99] [1.11, 1.13] [0.93, 0.94]
female 1.56 3.89 0.74
[1.31, 1.85] [3.06, 4.93] [0.63, 0.87]
primsec 0.99 1.78 1.00
[0.83, 1.17] [1.36, 2.33] [0.85, 1.17]
Num.Obs. 5874
R2 0.161
R2 Adj. 0.160
AIC 10225.0
BIC 10305.2
RMSE 0.34

Switch the reference

dfclean$marital <- relevel(dfclean$marital, ref = "div/sep")
mnom <- multinom(marital ~ age + female + primsec, data =dfclean, trace = FALSE)
modelsummary(mnom, shape = term ~ response:model, fmt = 2,
             statistic = 'conf.int', conf_level = 0.95, exponentiate = TRUE)
(1)
married widow single
(Intercept) 4.00 0.00 19.75
[2.89, 5.54] [0.00, 0.00] [13.13, 29.72]
age 1.01 1.14 0.95
[1.01, 1.02] [1.12, 1.15] [0.94, 0.96]
female 0.64 2.50 0.48
[0.54, 0.76] [1.88, 3.31] [0.39, 0.59]
primsec 1.01 1.80 1.01
[0.85, 1.20] [1.32, 2.44] [0.82, 1.25]
Num.Obs. 5874
R2 0.161
R2 Adj. 0.160
AIC 10225.0
BIC 10305.2
RMSE 0.34

Plot the results

mn2 <- multinom(marital ~ age + female, data =dfclean, trace = FALSE)
tidy(mn2)
# A tibble: 9 × 6
  y.level term        estimate std.error statistic   p.value
  <chr>   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
1 married (Intercept)   1.39     0.165        8.39 4.93e- 17
2 married age           0.0123   0.00298      4.13 3.70e-  5
3 married female       -0.442    0.0876      -5.05 4.50e-  7
4 widow   (Intercept)  -9.07     0.390      -23.3  8.39e-120
5 widow   age           0.132    0.00543     24.3  5.37e-130
6 widow   female        0.955    0.143        6.66 2.71e- 11
7 single  (Intercept)   2.98     0.207       14.4  7.36e- 47
8 single  age          -0.0521   0.00421    -12.4  3.48e- 35
9 single  female       -0.740    0.108       -6.85 7.45e- 12

modelsummary(mn2, shape = term ~ response:model, fmt = 2,
             statistic = 'conf.int', conf_level = 0.95, exponentiate = TRUE)

Plot the results

(1)
married widow single
(Intercept) 4.01 0.00 19.75
[2.90, 5.54] [0.00, 0.00] [13.15, 29.65]
age 1.01 1.14 0.95
[1.01, 1.02] [1.13, 1.15] [0.94, 0.96]
female 0.64 2.60 0.48
[0.54, 0.76] [1.96, 3.44] [0.39, 0.59]
Num.Obs. 5874
R2 0.159
R2 Adj. 0.159
AIC 10237.7
BIC 10297.8
RMSE 0.34

Plot the results

dm <- data.frame(female = rep(c(0,1), each=41), age = rep(c(30:70),2))
pdm <- cbind(dm, predict(mn2, newdata = dm, type = "probs", se = TRUE))
library(reshape2)
lpdm <- melt(pdm, id.vars = c("female", "age"), value.name = "probability")
ggplot(lpdm, aes(x = age, y = probability, 
                 colour = variable, linetype = as.factor(female))) +
  geom_line() +
  theme_minimal()

Multinomial logit models

  • Only if you really need it
  • Predicted probabilities usually necessary to make sense of the outcomes
  • Large sample size needed
  • Independence of Irrelevant Alternatives assumption (see this note)
  • Use separate logits to detect outliers or influential cases
  • Contrast test in the marginaleffects package

Main take-aways

  1. What is the aim of your model comparisons?
  2. Which marginal effects do you need to show to answer your research question?
  3. Use KHB method when needed
  4. Trade-offs of multinomial logistic regression

Preparing for next week:

What you need to cover before the lecture and lab:

  • revise basics of OLS, variance components, interactions

  • pratice in R

  • read relevant part of Snijders & Bosker!

  • Think about applications of multilevel models in your own research and areas of interest

Thanks!